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Abstract 

Respondent-driven sampling is a survey method for hidden or hard-to-reach populations in 
which sampled individuals recruit others in the study population via their social links. The most 
popular estimator for for the population mean assumes that individual sampling probabilities are 
proportional to each subject’s reported degree in a social network connecting members of the 
hidden population. However, it remains unclear under what circumstances these estimators are 
valid, and what assumptions are formally required to identify population quantities. In this short 
note we detail nonparametric identification results for the population mean when the sampling 
probability is assumed to be a function of network degree known to scale. Importantly, we 
establish general conditions for the consistency of the popular Volz-Heckathorn (VH) estimator. 
Our results imply that the conditions for consistency of the VH estimator are far less stringent 
than those suggested by recent work on diagnostics for RDS. In particular, our results do not 
require random sampling or the existence of a network connecting the population. 
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1 Introduction 


Respondent-driven sampling (RDS) is a meth od for surveying hidden or hard-to-reach p opulations 
such as sex workers or injection drug users ( Heckathom . 1997 : Broadhead et al . 1998h . Starting 
with a group of initial subjects called “seeds”, respondents recruit others who are also members of 
the study population by giving them “coupons” to present to the researcher. These new subjects are 
interviewed, given coupons, and the process repeats. Many researchers have approximated RDS as 
a samplin g design in which the sampling probability for subject i is proportional to their network 
degree cl, (Salgani k and Heckathom L 2004 1 Volz and Heckathom L 2008 1 Gile and Handcock , 20 id: 


Gild. 1201 ll) . In particular, [Safganik and Heckathoml (120041) and I Volz and Heckathoml (120081) justify 
this choice by modeling the recruitment process as a with-replacement random walk on a connected 
population network, where only one coupon is given to each subject, recruitment is uniformly at 
random from network neighbors, and each subject can be recruited infinitely many times. For an 
RDS sample of size n, Volz and Heckathom (2008) (hereafter VH) give the estimator 


P>VH ~ 


ruyidr 1 

rudr 1 


(i) 


where y,- is the outcome of interest and dj is the degree of subject /. 
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Network structure assumptions Sampling assumptions 


Random-walk model 
Remove seed dependence 

Respondent behaviour 


Network size large (N n) 

Homophily sufficiently weak, 
bottlenecks limited, 
connected graph 

All ties reciprocated 


With-replacement sampling, 
single non-branching chain 

Enough sample waves 

Degree accurately measured, 
random referral 


Table 1: Assumptions listed by Gile et al (20151) as requirements for the VH estimator. Reproduced 
from their Table 1. 


Several authors have e xpressed skepticism about RfiS survey methodology in general and the _ 

VH estimator in particular (Heimer . 2005 : Johnston et al . 2008 : Goel and Salganik . 201dl: Gile and Handcock . 


2010; Salgani kl.l2012l:IWhite et all.l201 21). Many alternative characterization s of the recruitment pro 


cess exist ( Goel and Salganik . 2009; Gil e and Handcock . 2010l: Gile . 2011 : Berchenko et al . 2013 : 


CrawfordL 1201411 . Empiri cal studies have also cast doubt on the performance of the V H estimator in 


real-world RDS datasets (Weinert, 20091: iMcCreesh et alll2012l:lRudolph et all 

A recent paper bvlG il e et al ( 201511 presents diagnostics whose purpose is to help researchers de¬ 
termine whether the assumptions often invoked to motivate the VH estimator (JT]) are met in empirical 
RDS data. The diagnostics presented by Gile et al 2015il address a particular class of motivating 
assumptions about the structure of a hypothesized social network and the process by which new 
subjects are sampled. These assumptions, characterized by Gile et al (2015, pg. 3) as “required by 
the [VH] estimator,” are summarized in Table 1, reproduced from the original paper. 

In this short note, we give an alternative, nonparametric set of conditions under which the VH 
estimator is consistent, and note identification conditions for a generalization of the VH estimator. 
The conditions we articulate for consistency are restrictive and untestable, but they are nevertheless 
less stringent than the traditional model used to justify the VH estimator. Consistency of the VH 
estimator does not require random sampling or even the existence of a network connecting the 
members of the study population. Our results clarify the inferential challenges posed by RDS data, 
challenges beyond those of other non-probability samples. Importantly, however, our results suggest 
conditions that can be more generally implied by other generative models that may justify the VH 
estimator or variants thereof. 


2 Results 

Formally, consider a sequence of populations and samples converging weakly to a joint limit dis¬ 
tribution on the outcome, (reported) degree, and sample, denoted (Y,D,S). Let the E [-] and Pr[-] 
operators refer to features of this limiting distribution. In RDS, we observe the empirical joint dis¬ 
tribution of the outcome Y and degree D conditional on the sampling indicator S = I. Without loss 
of generality, suppose that Y has bounded support and that D has support in the set {1..... V}. 

Condition 1 (Ignorability). For all k such that Pr[D = k] >0, E [7|S = 1 ,D =k] = E [Y\D = k] and 
Pr[S =]\D = k} >0. 

Condition 2 (Knowledge of the Conditional Probability of Sampling). Pr[S = 1| D = k\ = f(k), 
where /(•) is known up to a unknown scale parameter c. 
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Proposition 1. Given Conditions [7] and\2\ the population mean is identified, with 


m 


Lf=iE[P|5 = 


1 ,D = k] 


Pr[D=/t|S=l] 

m 


Pr[Z)=*|S=l] 
U=\ f(k) 


( 2 ) 


Proof. We can identify E[Y\D = k] for each degree k from Condition|7]since E[Y\D = k\ = E [7 |S = 
1 ,D = k\. We can identify each Pr[D = k] to scale directly from Condition [2] as 

PCr .'1 p r[2> = £|S=l]Pr[S=l] Pr[^ = 1] Pr[P = k\S = 1] 

L VJ Pr[5 = l\D = k] c f(k) ' K 

Then by the law of total expectation, 


E[Y] 


Lk=i e[y\d = k ] Pr[P 7 ( g- 1] 

Pr[S=l] Pr[D=*|S=l] 

U=\ c f(k) 


LLiE [Y\D = k] ^ D J^- 1 

x-K Pr[Z)=i|S=ll 
U=\ f(k) 


(4) 


□ 


Given Proposition [H consistency of the VH estimator directly follows from convergence of sample 
analogues to population quantities. 

Corollary 1. Given Conditions\T\and\2\ the VH estimator is consistent for E [T] if f(k) <=< k. 


3 Discussion 


A varia nt of Condition [pis usually assumed implicitly in statistical arguments in favor of the VH es¬ 
timator ( ISalganik and HeckathornU2004l : ISalganikil2006l : IVolz and HeckathornU2008l ). Ignorability 
is not empirically testable from RDS data alone, since researchers never observe E [7|S = 0,D = k\ 
for any k. While ignorability is a strong but common assumption imposed for inference from non¬ 
probability samples, Condition [2] highlights the additional challenges posed by RDS data. The 
researcher does not generally have knowledge of the population distribution of degree, and thus 
ignorability with respect to degree is not sufficient to identify the population mean. Specification of 
the conditional sampling probability in Condition [2]provides an alternative means for identification, 
and has typically been the focus of researchers’ efforts to justify the VH estimator. The random- 
walk argument serves to motivate the choice of f(k) «: k in the VH estimator, but is not strictly 
necessary for its consistency. Under any model that implies subjects with higher reported degrees 
are more likely to be sampled and f{k) °< k characterizes this relationship. Condition [2] holds. Fi¬ 
nally, we note that our results suggest that the VH estimator and variants thereof may be appropriate 
even when diagnostics predicated on a more restrictive model (e.g.. Gile et al. 120151) fail. 

Without knowledge of the characteristics of the unsampled subjects, neither Condition 1 nor 
Condition 2 has directly testable implications, and thus the value of any diagnostics must depend 
on further assumptions about the generative process. Under further parametric assumptions, some 
of the conditions listed in Table 1 might be sufficient to imply consistency of the VH estimator. A 
formalization of these assumptions as part of a generative model for the recruitment process would 
allow researchers to evaluate the statistical properties of diagnostics like those proposed by Gile et al 
d2015h . 
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